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Abstract 

The exploration-exploitation dilemma has been an intriguing and un- 
solved problem within the framework of reinforcement learning. "Opti- 
mism in the face of uncertainty" and model building play central roles in 
advanced exploration methods. Here, we integrate several concepts and 
obtain a fast and simple algorithm. We show that the proposed algorithm 
finds a near-optimal policy in polynomial time, and give experimental 
evidence that it is robust and efficient compared to its ascendants. 

1 Introduction 

Reinforcement learning (RL) is the art of maximizing long-term rewards in a 
stochastic, unknown environment. In the construction of RL algorithms, the 
choice of exploration strategy is of central significance. 

We shall examine the problem of exploration in the Markov decision process 
(MDP) framework. While simple methods like e-greedy and Boltzmann explo- 
ration are commonly used, it is known that their behavior can be extremely poor 
(?). Recently, a number of efficient exploration algorithms have been published, 
and for some of them, formal proofs of efficiency also exist. We review these 
methods in Section [2l By combining ideas from several sources, we construct 
a new algorithm for efficient exploration. The new algorithm, optimistic initial 
model (OIM), is described in Section [3l In Section IH we show that many of 
the advanced algorithms, including ours, can be treated in a unified way. We 
use this fact to sketch a proof that OIM finds a near-optimal policy in poly- 
nomial time with high probability. Section [5] provides experimental comparison 
between OIM and a number of other methods on some benchmark problems. 
Our results are summarized in Section [51 In the rest of this section, we review 
the necessary preliminaries, Markov decision processes and the exploration task. 

1.1 Markov decision processes (MDPs) 

Markov decision processes are the standard framework for RL, and the basis 
of numerous extensions (like continuous MDPs, partially observable MDPs or 
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factored MDPs). An MDP is characterized by a quintuple {X, A, TZ, P, 7), where 
X is a finite set of states; A is a finite set of possible actions; TZ : XxAxX ^ 
is the reward distribution, R{x,a,y) denotes the mean value of TZ{x,a,y), P : 
XxAxX ^ [0, 1] is the transition fiuiction; and finally, 7 G [0, 1) is the discount 
rate on future rewards. We shall assume that all rewards are nonnegative and 
bounded from above by i?max- 

A (stationary) policy of the agent is a mapping tt : X x ^ ^ [0, 1]. For 
any xq & X, the policy of the agent and the parameters of the MDP deter- 
mine a stochastic process experienced by the agent through the instantiation 
XQ,aQ,rQ,xi,ai,ri, . . . ,Xt,at,rt, . . . 

The goal is to find a policy that maximizes the expected value of the dis- 
counted total reward. Let us define the state-action value function (value func- 
tion for short) of tt as Q'^ {x, a) := E (j2t^o "T*^* x = xo,a = aoj and the optimal 
value function as 

Q*{x,a) ina'xQ'^{x,a) 

TT 

for each (.r, a) G A x A. Let the greedy action at x w.r.t. value function Q 
he := argmaxa Q{x,a). The greedy policy of Q deterministically takes the 
greedy action in each state. It is well-known that the greedy policy of Q* is an 
optimal policy and Q* satisfies the Bellman equations: 

Q* {x, a) = J2 Pi^' y) {Pi^^ y) + iQ* (y- )) • 



1.2 The exploration problem 

In the classical reinforcement learning setting, it is assumed that the environ- 
ment can be modelled as an MDP, but its parameters (that is, P and R) are 
unknown to the agent, and she has to collect information by interacting with the 
environment. If too little time is spent with the exploration of the environment, 
the agent will get stuck with a suboptimal policy, without knowing that there 
exists a better one. On the other hand, the agent should not spend too much 
time visiting areas with low rewards and/or accurately known parameters. 

What is the optimal balance between exploring and exploiting the acquired 
knowledge and how could the agent concentrate her exploration efforts? These 
questions arc central for RL. It is known that the optimal exploration policy in 
an MDP is non-Markovian, and can be computed only for very simple tasks like 
fc-armed bandit problems. 



2 Related literature 

Here we give a short review about 
methods and their properties. 



some of the most important exploration 
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2.1 e-greedy and Boltzmann exploration 

The most popular exploration method is e-greedy action selection. The method 
works without a model, only an approximation of the action value function 
Q{x, a) is needed. The agent in state x selects the greedy action o§ or an ex- 
plorative move with a random action with probabilities 1 — e and e, respectively. 
Sooner or later, all paths with nonzero probability will have been visited many 
times, so, a suitable learning algorithm can learn to choose the optimal path. 
It is known, for example, that Q-learning with nonzero exploration converges 
to the optimal value function with probability 1 (?), and so does SARSA (?), 
if the exploration rate diminishes according to an appropriate schedule. 

Boltzmann-exploration selects actions as follows: the probability of choos- 
ing action a is '^^p(Q('''")/^) — ^ where 'temperature' T(>0) regulates the 

Ea'<=Aexp(Q(s,a')/T) 

amount of explorative actions. Convergence results of the e-greedy method carry 
through to this case. 

Unfortunately, for the e-greedy and the Boltzmann method, exploration time 
may scale exponentially in the number of states (?). 

2.2 Optimistic initial values (OIV) 

One may boost exploration with a simple trick: the initial value of each state 
action pair can be set to some overwhelmingly high number. If a state x is 
visited often, then its estimated value will become more exact, and therefore, 
lower. Thus, the agent will try to reach the more rarely visited areas, where 
the estimated state values are still high. This method, called 'exploring starts' 
or 'optimistic initial values', is a popular exploration heuristic (?), sometimes 
combined with others, e.g., the e-greedy exploration method. Recently, ? (?) 
gave theoretical justification for the method: they proved that if the optimistic 
initial values are sufficiently high, Q-learning converges to a near-optimal solu- 
tion. One apparent disadvantage of OIV is that if initial estimations are too 
high, then it takes a long to fix them. 

2.3 Bayesian methods 

We may assume that the MDP (with the unknown values of P and R) is drawn 
from a parameterized distribution M.q. Prom the collected experience and 
the prior distribution Mq, ^ffe can calculate successive posterior distributions 
A^t, t = 1, 2, ... by Bayes' rule. Furthermore, we can calculate (at least in prin- 
ciple) the policy that minimizes the uncertainty of the parameters (?). ? (?) 
approximates the distribution of state values directly. Exact computation of the 
optimal exploration policy is infeasible and Bayesian methods are computation- 
ally demanding even with simplifying assumptions about the distributions, e.g., 
the independencies of certain parameters. 
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2.4 Confidence interval estimation 

Confidence interval estimation algorithms are between Baycsian exploration and 
OIV. It assumes that each state value is drawn from an independent Gaussian 
distribution and it computes the confidence interval of the state values. The 
agent chooses the action with the highest upper confidence bound. Initially, all 
confidence intervals are very wide, and shrink gradually towards the true state 
values. Therefore, the behavior of the technique is similar to OIV. The IEQL+ 
method of ? (?) directly estimates confidence intervals of Q-valucs, while ? 
(?) calculate confidence intervals for P and R, and obtain Q-value bounds 
indirectly. ? (?) improve the method and prove a polynomial-time convergence 
bound. Both algorithms are called model-based interval estimation. To avoid 
confusion, we will refer to them as MBIE(WS) and MBIE(SL). 

? (?) give a confidence interval-based algorithm, for which the online regret 
is only logarithmic in the number of steps taken. 

2.5 Exploration Bonus Methods 

The agent can be directed towards less-known parts of the state space by in- 
creasing the value of 'interesting' states artificially with bonuses. States can be 

interesting given their frequency, recency, error, etc. (?; ?). 

The balance of exploration and exploitation is usually set by a scal- 
ing factor K, so that the total immediate reward of the agent at time t is 
Tf + K ■ hi {xi , at, Xt^i), where bt is one of the above listed bonuses. The bonuses 
are calculated by the agent and act as intrinsic motivating forces. Exploration 
bonuses for a state can vary swiftly and model-based algorithms (like priori- 
tized sweeping or Dyna) are used for spreading the changes effectively. Alas, 
the weight of exploration k needs to be annealed according to a suitable sched- 
ule. 

Alternatively, the agent may learn two value functions separately: a regular 
one, Ql which is based on the rewards rt received from the environment, and 
an exploration value function Qf which is based on the exploration bonuses. 
The agent's policy will be greedy with respect to their combination Q[ + nQf. 
Then the exploration mechanism may remain the same, but several advantages 
appear. First of all, the changes in k take effect immediately. As an example, 
we can immediately switch off (exploration by setting k to 0. Furthermore, Ql 
may converge even if Qf docs not. 

Confidence interval estimation can be phrased as an exploration bonus 
method: see lEQL-h (?) or MBIE-EB (?). ? (?) have shown that e-greedy 
and Boltzmann explorations can be formulated as exploration bonus methods 
although rewards are not propagated through the Bellman equations. 

2.6 and R-max 

The Explicit explore or exploit {E^) algorithm of ? (?) and its successor, R-max 
(?) were the first algorithms that have polynomial time bounds for finding near- 
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optimal policies. R-max collects statistics about transitions and rewards. When 
visits to a state enable high precision estimations of real transition probabilities 
and rewards then state is declared known. R-max also maintains an approximate 
model of the environment. Initially, the model assumes that all actions in all 
states lead to a (hypothetical) maximum-reward absorbing state. The model 
is updated each time when a state becomes known. The optimal policy of the 
model is either the near-optimal policy in the real environment or enters a not- 
yet-known state and collects new information. 



3 Construction of the algorithm 

Our agent starts with a simple, but overly optimistic model. By collecting 
new experiences, she updates her model, which becomes more realistic. The 
value function is computed over the approximate model with (asynchronous) 
dynamic programming. The agent always chooses her action greedily w.r.t. her 
value function. Exploration is induced by the optimism of the model: unknown 
areas are believed to yield large rewards. Algorithmic components are detailed 
below. 

Separate exploration values. Similarly to the approach of ? (?), we 
shall separate the 'true' state values from exploration values. Formally, the 
value function has the form 

Q{x, a) = Q''{x, a) + Q^ix, a) 

for all (.T, a) S X x A, where and will summarize external and exploration 
rewards, respectively. 

'Garden of Eden' state. Similarly to R-max, we introduce a new hy- 
pothetical 'garden of Eden' state xe, and assume an extended state space 
X' = X\^ {xe)- Once there, then, according to the inherited model, the agent 
remains m.XE indefinitely and receives i?max reward for every step, which may 
exceed i?^ax =• '^'^'^x,a,y R{x, CI, y), the maximal reward of the original environ- 
ment. 

Model approximation. The agent builds an approximate model of the 
environment. For each x,y G X and a G A, let Nt{x,a), Nt{x,a,y), and 
Ct{x,a,y) denote the number of times when a was selected in x up to step t, 
the number of times when transition x y was experienced, and the sum of 
external rewards for x A y transitions, respectively. With these notations, the 
approximate model parameters are 

£, Nt{x,a,y) ' Ct{x,a,y) 
Pt{x,a,y) = — — and Rt{x,a,y) = — -. 

Nt{x,a) Nt{x,a,y) 

Suitable initializations of Nt{x,a), Nt{x,a,y) and Ct{x,a,y) will ensure that 
the ratios are well-defined everywhere. The exploration rewards are defined as 



R''{x,a,y) := 



-Rmax, iiy = XE] 

0, if 2/ 7^ Xe, 
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for each x,y G XL) {xe}, a G A, and are not modified during the course of 
learning. 

Optimistic initial model. The initial model assumes that xe has been 
reached once for each state-action pairs: for each x G X L) {a;^;}, y G X and 
aG A, 

No{x,a) = 1, 

No{x,a,y) =0, Co{x,a,y) =0. 
No{x,a,XE) ^ I, Coix,a,XE) = 0- 

Then, the optimal initial value function equals 

Qo{x, a) = Qo{x, a) + Ql{x, a) = + -^-Rmax := Kiax 

1-7 

for each {x, a) G X' x A, analogously to OIV. 

Dynamic programming. Both value functions can be updated using the 
approximate model. For each x E X, let be the greedy action according to 
the combined value function, i.e., 

:= a,Tgma,x(Q''{x,a) + Q^{x,a)). 
The dynamic programming equations for the value function components are 

Qt+i(x, a) := ^ Ptix, a, y) {ktix, a, y) + 7<3( (y, ay)^ 
yex 

Qt+i{x, a) :=7 ^ Pt{x, a, y)Qt{y, ay) 

+ Pt{x,a,XE)V.^^^. 

Episodic tasks can be handled as usual way; we introduce an absorbing final 
state with external reward. 

Asynchronous update. The algorithm can be online, if instead of full 
update sweeps over the state space updates arc limited to state set Lj in the 
'neighborhood' of the agent's current state. Neighborhood is restricted by com- 
putation time constraints; any asynchronous dynamic programming algorithm 
suffices. It is implicitly assumed that the current state is always updated, i.e., 
Xt G Li. In this paper, we used the improved prioritized sweeping algorithm of 
? (?). 

Putting it all together. The method is summarized as Algorithm 1. 

4 Analysis 

In the first part of this section, we analyze the similarities and differences be- 
tween various exploration methods, with an emphasis on OIM. Based on this 
analysis, we sketch the proof that OIM finds a near-optimal policy in polynomial 
time. 



6 



Algorithm 1 The Optimistic initial model algorithm 

Input: xo & X initial state, e > required precision, optimism parameter 

-^max 

Model initialization: t := 0; Vx,y € X, Va € A: 

N{x, a, y) := 0, N{x, a, xe) ■= 1, N{x, a) := 1, C{x, a, y) := 0, Q^(x, a) := 0, 

Q%x,a) := i?max/(l-7); 

repeat 

at := greedy action w.r.t. + apply at and observe Vt and Xt-y\ 

C{xt,at,xt+i) := C{xt,at,xt+i) + rt; N{xt, at, xt+i) := N{xt,at,xt+-L) + 1; 

N{xt,ai) := N{xt,at) + 1 

Lt := list of states to be updated 

for each x G Lt do 

Qt-^-l{x, a) := J2yex o-^ v) 

{r{ x,a,y) +^Q^{y,ayfj 

Qt+iix,a) := P{x,a,XE)Rma.J{i-7)+lT,yexPi^^0''y)Qt{y'0'y)- 
end for 
t:=t + l 
until Bellman-error> e 



4.1 Relationship to other methods 

'Optimism in the face of uncertainty' is a common point in exploration methods: 
the agent believes that she can obtain extra rewards by reaching the unexplored 
parts of the state space. 

Note that as far as the combined value function Q is concerned, OIM is an 
asynchronous dynamic programming method augmented with model approxi- 
mation. 

Optimistic initial values. Apparently, OIM is the model-based extension 
of the OIV heuristic. Note however, that optimistic initialization of Q- values is 
not effective with a model: the more updates are made, the less effect the ini- 
tialization has and it fully diminishes if value iteration is nm until convergence. 
Therefore, naive combination of OIV and model construction is contradictory: 
the number of DP-updates should be kept low in order to save the initial boost, 
but it should be as high as possible in order to propagate the real rewards 
quickly. 

OIM resolves this paradox by moving the optimism into the model. The 

optimal value function of the initial model is Qq = VJnax, corresponding to 
OIV. However, DP updates can not, but only model updates may lower the 
exploration boost. 

Note that we can set the initial model value as high as we like, but we do not 
have to wait until the initial boost diminishes, because and Q"^ are separated. 

R-max. The 'Garden of Eden' state xe of OIM is identical to the fictitious 
max-reward absorbing state of R-MAX (and E^). In both cases, the agent's 
model tells that all unexplored {x, a) pairs lead to a;_E. R-MAX, however, updates 
the model only when the transition probabilities and rewards are known with 
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high precision, which is only after many visits to {x,a). In contrast, OIM 
updates the model after each single visit, employing each bit of experience as 
soon as it is obtained. As a result, the approximate model can be used long 
before it becomes accurate. 

Exploration bonus methods. The extra reward offered by the Garden 
of Eden state can be understood as an exploration bonus: for each visit of 
the pair {x,a), the agent gets the bonus bt{x,a) = ivr^^(Kiax - Qt{x,a)). 
It is insightful to contrast this formula with those of the other methods like 
the frequency-based bonus bt = —a ■ Nt{x,a) or the error-based bonus bt = 
a ■ \Qi+i{x,a) - Qi{x,a.)\. 

Model-based interval exploration. The exploration bonus form of the 
MBIE method of ? (?) sets bt = ■w;^^- MBIE-EB is not an ad-hoc method: 
the form of the bonus comes from confidence interval estimations. The compar- 
ison to MBIE-EB will be especially valuable, as it converges in polynomial-time 
and the proof can be transported to OIM with slight modifications. 



4.2 Polynomial-time convergence 



Theorem 4.1 For any e > 0, i5 > 0, ei := e/6, €2 ■= ^xi^i^+ro — ) ' ^i' ^ ■~ 
In j^^, m := ^™'^^{ym»x} jj^ 8^ OIM converges almost surely to a near- 
optimal policy in polynomial time if started with Rmax = ^^'^■""■'^^'"^^^l^]!"^!"'^^^ , 
that is, with probability I — S, the number of timesteps where {xt,at) > 

Q*{xt, at) — e does not hold, is at most In |- 

The proof can be found in the Appendix. 



5 Experiments 

To assess the practical utility of OIM, we compared its performance to other 
exploration methods. Experiments were run on several small benchmark tasks 
challenging exploration algorithms. 

For fair comparisons, benchmark problems were taken from the literature 
without changes, nor did we change the experimental settings or the presentation 
of experimental data. It also means that the presentation format varies for 
different benchmarks. 

5.1 River Swim and SixArms 

The first two benchmark problems, RiverSwim and SixArms, were taken from 
• ^-^^ 

The RiverSwim MDP has 6 states, representing the position of the agent in 
a river. The agent has two possible actions: she can swim either upstream or 
downstream. Swimming down is always successful, but swimming up succeeds 
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Table 1: Results on the RiverSwim task. 



Method 


Cumulative reward 




3.020 


10" 


± 


0.027 


•10^ 


R-MAX 


3.014 


10" 


± 


0.039 


•10^ 


MBIE(SL) 


3.168 


10^ 


± 


0.023 


•10^ 


MBIE-EB 


3.093 


10^ 


± 


0.023 


•10^ 


OIM 


3.201 


10^ 


± 


0.016 


•10^ 



Table 2: Results on the Six Arms task. 



Method 


Cumulative reward 




1.623-106 


± 


0.244 


-10^ 


R-MAX 


2.819-lOS 


± 


0.256 


-10^ 


MBIE(SL) 


9.205-lOS 


± 


0.559 


-10^ 


MBIE-EB 


9.486-106 


± 


0.587 


-10^ 


OIM 


10.007-10^ 


± 


0.654 


-10^ 



only with a 30% chance and there is a 10% chance of slipping down. The 
lowermost position yields +5 reward per step, while the uppermost position 
yields +10000. 

The SixArms MDP consists of a central state and six 'payoff states'. In the 
central state, the agent can play 6 one-armed bandits. If she pulls arm k and 
wins, she is transferred to payoff state k. Here, she can get a reward in each 
step, if she chooses the appropriate action. The winning probabilities range 
from 1 to 0.01, while the rewards range from 50 to 6000 (for the exact values, 
see ?). 

Data for E^ , R-MAX, MBIE and MBIE-EB are taken from ? (?). Param- 
eters of all four algorithms were chosen optimally. Following a coarse search in 
parameter space, the i?max parameter for OIM was set to 2000 for RiverSwim 
and to 10000 for SixArms. State spaces are small and value iteration instead of 
prioritized sweeping was completed in each step. 

On both problems, each algorithm ran for 5000 time steps and the undis- 
counted total reward was recorded. The averages and 95% confidence intervals 
are calculated over 1000 test runs (Tables [Hand [2]). 

5.2 50 X 50 maze with subgoals 

Another benchmark problem, MazeWithSubgoals, was suggested by ? (?). The 
agent has to navigate in a 50 x 50 maze from the start position at (2, 2) to the 
goal (with -1-1000 reward) at the opposite corner (49,49). There are suboptimal 
goals (with -1-500 reward) at the other two corners. The maze has blocked places 
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Table 3: Results on the MazeWithSubgoals task. The number of steps required 
to learn p-optimal policies (p=0.95, 0.99, 0.998) on the 50 x 50 maze task with 
suboptimal goals. In parentheses: how many runs out of 20 have found the goal, 
'fc' stands for 1000. 



Method 


95% 


99% 


99.8% 


e-GREEDY, e = 0.2 


- (0) 


- (0) 


- (0) 


e-GREEDY, e = 0.4 


43k (4) 


52k (4) 


68k (4) 


Recency-bonus 


27k (19) 


55k (18) 


69k (9) 


FrEQ. -BONUS 


24k (20) 


50k (16) 


66k (10) 


MBIE(WS) 


25k (20) 


42k (19) 


66k (18) 


OlM 


19k (20) 


29k (20) 


31k (20) 



and punishing states (—10 reward), set randomly in 20-20% of the squares. The 
agent can move in four directions, but with a 10% chance, its action is replaced 
by a random one. If the agent tries to move to a blocked state, it gets a reward 
of —2. Reaching any of the goals resets the agent to the start state. In all other 
cases, the agent gets a —1 reward for each step. 

Each algorithm was rim on 20 different mazes for 100,000 steps. After every 
1000 steps, we tested the learned value functions by averaging 20 test runs, in 
each one following the greedy policy for 10,000 steps, and averaging cumulated 
(undiscounted) rewards. We measured the number of test runs needed for the 
algorithms to learn to collect 95%, 99% and 99.8% of the maximum possible 
rewards in 100,000 steps, and the number of steps this takes on average, if the 
algorithms can meet the challenge. 

The algorithms that we compared were the recency based and fre- 
quency based exploration bonus methods, two versions of e-greedy exploration, 
MBIE(WS) and OIM. All exploration rules applied the improved prioritized 
sweeping of ? (?). OIM's i?rnax was set to 1000. The results are summarized 
in Table H 

5.3 Chain, Loop and FlagMaze 

The next three benchmark MDPs, the Chain, Loop and FlagMaze tasks were 
investigated, e.g., by ? (?), ? (?) and ? (?). In the Chain task, 5 states are 
lined up along a chain. The agent gets +2 reward for being in state 1 and -1-10 
for being in state 5. One of the actions advances one state ahead, the other one 
resets the agent to state 1. The Loop task has 9 states in two loops (arranged in 
a 8-shape) . Completing the first loop (using any combination of the two actions) 
yields -1-1 reward, while the second loop yields +2, but one of the actions resets 
the agent to the start. The FlagMaze task consists of a 6 x 7 maze with several 
walls, a start state, a goal state and 3 flags. Whenever the agent reaches the 
goal, her reward is the number of flags collected. 

The following algorithms were compared: Q-learning with variance-based 
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Table 4: Average accumulated rewards on the Chain task. Optimal policy 
gathers 3677. 



Method 


Phase 1 


Phase 2 


Phase 8 


QL+VAR.-BONUS 




2570^ 




QL+ERR. -BONUS 




2530^ 




QL e-GREEDY 


1519 


1611 


1602 


QL BOLTZMANN 


1606 


1623 




1EQL+ 


2344 


2557 




Bayesian QL 


1697 


2417 




Bayesian DP^ 


3158 


3611 


3643 


OIM 


3510 


3628 


3643 



Table 5: Average accumulated rewards on the Loop task. Optimal poHcy gathers 
400. 



Method 


Phase 1 


Phase 2 


Phase 8 


QL+VAR.-BONUS 




1791 




QL-I-ERR.-BONUS 




1791 




QL e-GREEDY 


337 


392 


399 


QL BOLTZMANN 


186 


200 




lEQL-l- 


264 


293 




Bayesian QL 


326 


340 




Bayesian DP^ 


377 


397 


399 


OIM 


393 


400 


400 



and TD error-based exploration bonus (model- free variants), e-greedy explo- 
ration, Boltzmann exploration, IEQL-I-, Bayesian Q-learning, Bayesian DP and 
OIM. Data were taken from ? (?), ? (?) and ? (?). According to the sources, 
parameters for all algorithms were set optimally. OIM's iimax parameter was 
set to 0.5, 10 and 0.005 for the three tasks, respectively. 

Each algorithm ran for 8 learning phases. The total cumulated reward over 
each learning phase was measured. One phase lasted for 1000 steps for the first 
two tasks and 20,000 steps for the FlagMaze task. We carried out 256 parallel 
runs for the first 2 tasks and 20 for the third one. 

1 Results for Phase 5. 

2 Augmented with limited amount of pre- wired knowledge (the list of successor 
states). 
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Table 6: Average accumulated rewards on the FlagMaze task. Optimal policy 
gathers approximately 1890. 



Method 


Phase 1 


Phase 2 


Phase 8 


QL e-GREEDY 


655 


1135 


1147 


QL BOLTZMANN 


195 


1024 




1EQL+ 


269 


253 




Bayesian QL 


818 


1100 




Bayesian DP^ 


750 


1763 


1864 


OlM 


1133 


1169 


1171 



6 Summary of the results 

We proposed a new algorithm for exploration and reinforcement learning in 
Markov decision processes. The algorithm integrates concepts from other ad- 
vanced exploration methods. The key component of our algorithm is an op- 
timistic initial model. The optimal policy according to the agent's model will 
either explore new information that helps to make the model more accurate, or 
follows a near-optimal path. The extent of optimism regulates the amount of 
exploration. We have shown that with a suitably optimistic initialization, our 
algorithm finds a near-optimal policy in polynomial time. Experiments were 
conducted on a number of benchmark MDPs. According to the experimental 
results our novel method is robust and compares favorably to other methods. 
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A Proof of Polynomial-time convergence 

For the proof, we shall follow the technique of ? (?) and ? (?), and will use 
the shorthands [KS] and [SL] for referring to them. We will proceed by a series 
of lemmas. 

Throughout the proof, note the difFcrcncc between Rmax and -R^J^ax- Value 
estimates of our model start from i?max- However, all actual rewards observed 
by the agent are bounded by -R^Jjaxi which is smaller than iZmax- 

Lemma A. 1 (Azuma's Lemma) If the random variables Xi,X2,-.- form a 
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martingale difference sequence, meaning that E[Xk\Xi,X2-, ■ ■ . ,Xk-i] = for 
all k, and |Xfc| < b for each k, then 



> a 



and 



Pr 



> a 



< 2 exp 



2bH: I 



The following lemma is similar to Lemma 5 of [KS] (with the modification 
that R{x,a,y) values are learnt instead of i?(a;, a)-values, and tells that if a 
state-action pair is visited many times, then its parameter estimates become 
accurate. 

Lemma A. 2 Consider an MDP M = (X, ^4, P, i?, 7), and let {x,a) be a state- 
action pair that has been visited at least m times. Let P{x, a, y) and R{x, a, y) 
denote the obtained empirical estimates, lete>0 and S > be arbitrary positive 
values. If 

2max{l,i?LJ2 2 



m > 



then for ally & X, 

P{x, a, y)R{x, a, y) - P{x, a, y)R{x, a, y) 

P{x,a,y) - P{x,a,y) 
holds with probability at least 1 — S. 



< e 

< e 



and 



Proof. Suppose that {x,a) is visited k times at steps t\,...,tk. Define the 
random variables 

1- if -T^t.+i = 

0, otherwise. 



Zi{y) 



Clearly, E[Zi[y)] = P{x,a,y) and Zi{y) — P{x,a,y) is a martingale, so we can 
apply Azuma's lemma with a = ke to get 



Pr 



1 

T^Zi- P{x, a, y) 



> e 



< 2 exp ( — — ) < 2 exp 



The right-hand side is less than 5 for 



2 2 
m> In - . 



Similarly, define the random variables 

Wi{y) -- 



rti+i, iixti+i=y; 
0, otherwise. 
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In this case, E[Wi{y)] = P{x,a,y)R{x,a,y), Wi{y) — P{x,a,y)R{x,a,y) is a 
martingale and is bounded by -Rmax (note that we are considering only states 
x,y G X, that is, the garden-of-Eden state xe is excluded. Therefore, -R^ax 
indeed an upper bound on R{x,a,y)), so we can apply Azuma's lemma with 
a = fce to get 



Pr 



1 

T X] ~ -P(a;, a, y)R{x, a, y) 



> e 



< 2 exp 



2(-Rmax)^ 



< 2 exp 



The right-hand side is less than 5 for 



2(iCax)!, 2 

m > in - . 



Unifying the two requirements for m completes the proof of the lemma. I 
The following is a minor modification of [KS] lemma 4, and [SL] Lemma 1. 
The result tells that if the parameters of two MDPs are very close to each other, 
then the value functions in the two MDPs will also be similar. 



Lemma A. 3 Let e > 0, and consider two MDPs M = {X,A,P,R,^) 
M = {X, A, P, R, 7) that differ only in their transition and reward functions, 
furthermore, their difference is bounded: 

\P{x,a,y)R{x,a,y) - P{x,a,y)R{x,a,y)\ < e' 
\P{x,a,y) - P{x,a,y)\ < e' 



for all {x,a,y) € X X A X X 



e := 



(1-7)' 



e. 



\X\{l-j + RO^^) 

Then for any policy n and any {x, a) G X x A, 

\Q''{x,a) - Q''{x,a)\ < e. 

Proof. Let A := max(^x,a)eXxA\Q^{x,a) — Q'^{x,a)\, and note that for any 
x€ X, 



\V''{x)-V''{x)\ = 



J2Ax,a){Q''{x,a) - Q''{x,aj) 



< ^n{x,a)A = A 



For a fixed {x, a) pair, 

A= \Q^{x,a)-Q''{x,a)\ 



P{x, a, y) a, y) + ^V^{y)) - ^ P{x, a, y) (r{x, a, y) + 7V" (y)) 



yex 
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- \P{x,a,y)R{x,a,y) - P{x,a,y)R{x,a,y)\ 



yex 
+ 



Y,[Pix,a,y)-P{x,a,y)] (jV^y) 



yex 



^P{x,a,y){j[v^y)-V^y)]) 



yex 



<\X\e'+J2 e'|7V^"(y)| + E Pix,a,y){jA) 

yex yex 

< \X\€' + \X\e'?^+jA. 



1-7 



Therefore, 



A < 



\X\e'{l-^ + KJ 
(1-7)2 



Let us introduce a modified version of OIM tfiat befiaves exactly like the 
old one, except that in each (x, a) pairs, it performs at most m updates. If a 
pair is visited more than m times, the modified algorithm leaves the counters 
unchanged. 

The following result is a modification of [SL]'s Lemma 7. 

Lemma A. 4 Suppose that the modified OIM (stopping after m updates) is ex- 
ecuted on an MDP M = (X, A, P, R, 7) with 



m := 



/3 := 



2ma^{l,i;^,j2 ^^2 



5' 

V'21n(2|X| \A\m/5). 



1-7 

Then, with probability at least 1 — 6, 

Q*{x,a) - ^Pt{x,a,y) Rt{x,a,y) + -fV*{y) < (3/Vk 
yex 

for all t= 1,2,... 

Proof. Fix a state-action pair (a;, a) and suppose that it has been visited k < 
m times until time step t, at steps ti,...,tk. Define the random variables 
Xi,...,Xk by 

Xi :=rt. +7l/*(a;t.+i). 

Note that E[Xi] = Q*{x, a) and < < Rl,^J{l - 7) for alH = 1, . . . , A:, 
and the sequence Q*{x,a) — is a martingale difference sequence. Applying 
Azuma's lemma yields 



Pr 



k 

£;[Xi]- i^X, >a/fc 



< exp 



(1) 
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for any a. Let the right-hand side be equal to 2|x| | A|in ' corresponding to 

a = pVk 

with 

(3 i?Lx/(l - 7)V21n(2|X||A|™/J). 

Note that by the construction of the OIM algorithm, 



^ Pt{x,a,y) Rt{x,a,y) +'^V*{y) 



E 

yeX:Ntix,a,y)>0 

Nt{x,a) - 1 



Ntix,a,y) 
Nt{x,a) 

E 



Ct{x,a,y) 
Nt{x,a,y) 



+ lV*{y) 



Nt{x,a) 



[Rraa,->,+lV*(xGOE)] 



Nt{x,a,y) 



yeX:Nt{x,a,y)>0 



C't{x,a,y) 



k + 1 
k 



E 

yGX:Ntix.a.,y)>0 
k 



Nt{x,a,y) 
Ct{x,a,y) 



1 Rn 



lV*iy) 



Nt{x,a,y) 

k YNt{x,a,y) 

1 -Rmax 

/c + 1 1 - 7' 

where we exploited the fact that k ~ Nt{x,a) — \. Therefore, 

, k 



7Vt(x,a) 1-7 

1 -Rmax 



fc + 1 1 - 7 



E^' = ^ E A(a;,a,2/) a, y) + 7T^*(2/) 



i=l y£X' 

Substituting this to ([T]), we get that 
fc + 1 



1 Rn 



k 1 — 7 



\x,a) — ^ Pt{x,a,y) Rtix,a,y) + -fV*{y) 



yex' 



1 i?n 



k 1 — 7 



< 



P/Vk 



with high probability, but we will use only the slightly looser inequality 

Q*ix,a)-J2Ptix,a,y) [Mx, a,y) + jV* {y)] < P/Vk. 
vex 



(2) 



For each {x,a), the modified OIM algorithm changes the parameters at 
most m times, which is at most to|X| |A| changes in total. Each different 



approximation fails with probability less than 



so, by the union bound, 



2\X\\A\m ' 

the total probability that ^ fails (at any time, for any state-action pair) is still 
less than S/2. 

m 

The following result shows that the modified OIM algorithm preserves the 
optimism of the value function with high probability. 
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Lemma A. 5 Let ei > and suppose that the modified OIM is executed on an 
MDP M = {X, A, P, R, 7) with 

-^max — 



where 



_ 2ma^{l,flLxr i 2 
III ' — Q i-ti J. ) 



/3 := --Hi^v/21n(2|X|m m/5). 

1-7 

Then, with probability at least 1 — 5/2, Q^^^^ {x,a) > Q*{x,a) — ei for all 
t=l,2,... 

According to the previous lemma, 

^ Pt{x, a, y) {Rt{x, a, y) + iV*{y)) - Q*{x, a) > -p/^/Nt{x,a) (3) 
y 

with probabihty 1 — 6/2. 
We will show that 

(1 -7)ei > —===. (4) 



7Vt(x,a)(l -7) ,/Wt [x, a) 

For Nt{x, a) < (j^^pTj"; the first term dominates the l.h.s. and we can omit the 
second term (and prove the stricter inequality). In the following, we proceed by 
a series of equivalent transformations: 

-Rmax ^ 



Nt{x,a){l-^) - ^Nt{x,ay 



pa -7) 



max 



/?2(l-7)2 

which is implied by the stricter inequality 



> ^/N^{x^ 

> Nt{x,a), 



■'•'max ^ 



/32(l-7)2 - (l-7)2ei' 

-'*'inax _ ^ 

ei 

which holds by the assumption of the lemma. If the relation is reversed, then 
the first term can be omitted, leading to 

(l-7)ei > J , 
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(l-7)ei 

< 7Vt(a;,a), 



(l-7)2e? 

which is impUed by the stricter incquahty 



ei 



< 



similarly to the previous case. 

At step t, a number of DP updates are carried out. We proceed by induction 
on the number of DP-updates. Initially, a) > Q*{x, a) — ei, then 



y 



iVt(a;,a) 

^nax 



Nt{x, a 

> ^A(a;,a,2;)(i?t(a;,a,2;)+7(l/*(y)-ei)) + 

y 

> Q*{x, a) - l3/^/Nt{x, a) - 761 + 

> (9*(a;, a) - 761 - (1 - 7)61 = (5*(a;, a) - ei, 

where we applied ([3]), (|4]) and the induction assumption. ■ 
Define the H-stcp truncated value function of policy tt as Q'^{x,a,H) := 

^{j2f=o^*'^t x = xo,a^aoy 

Lemma A. 6 ([KS] Lemma 2) Let e > and consider an MDP M = 
(X,A,P,i?,7). // 

1 i?" 

H>- log— 

1-7 e(l-7) 

then 

Q'^ix, a, H) < Q'^ix, a) < Q'^(x, a, H) + e 
for any (x, a) G X x A. 

Proof. Let S(a;, a) denote the set of infinite trajectories starting in {x, a), and for 
any trajectory ^ g S(a;, a), let S.H denote its _ff-step truncation. Furthermore, 
denote the discounted total reward along a trajectory ^ by u(^). Clearly, 

Q^x,a) = E^[vm= J2 P'iOm and 
Q^x,a,H) = EiH^H)]^ P^OviU)- 
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Fix a trajectory p, along which the agent receives rewards ri, r2, . . ., for which 



v{£.h) = ^ 7*^t+i and 
t=o 

oo oo 

HO = """(^^^^ 

t=0 t=H 

It is trivial that w(^) > f (^/f), as the additional terms are all nonnegative by 
assumption. On the other hand, 

oo oo ^ 

t=H t=H ^ 

which is smaller than eii H > log '^^^i"''^"' / log 7 (which follows from the assump- 
tion of the lemma and the inequality — log 7 > 1 — 7), that is, 

v{0 < v{^h) + e. 

As the relations hold for each trajectory in a), they hold for the expected 
value, too. ■ 
The following lemma tells that OIM and its modified version learn almost 
the same values with high probability. 

Lemma A. 7 For any e > 0, S > 0, 

, . (1-7)' 



|X|(l-7 + i?15,,J 



2max{l,<„,J% ^2 
e'^ 

/or any MDP M and any {x,a) £ X x A, 

QTm ix,a)-Qlj {x,a) < 2e 
with probability at least 1 — 2S. 

Proof. The model estimates of the two algorithm- variants are identical on not- 
yet-known states where the visit count is less than m. On known pairs, we can 
apply Lemma Fa. 21 to both model-estimates to see that they are e'-close to the 
true model parameters with probability at least 1 ~ S. Consequently, they are 
2e'-close to each other with at least 1 — 25 probability. Applying Lemma IA.3I 
proves the statement of the lemma. I 

Lemma A.8 (Lemma 3_of [SL]) Let M = {X,A,P,R,'^) be an MDP, K a 
set oj state- action pairs, M an MDP equal to M on K (identical transition and 
reward functions) , tt a policy, and H some positive integer. Let Am be the event 
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that a state-action pair not in K is encountered in a trial generated by starting 
from {x, a) and following w for H steps in M. Then, 

dO 

Ql,{x, a) > Q^^ix, a) - ^ Pv{Am). (5) 

Proof. Let 5 be the set of i?-step long trajectories, and let c S be the set 
of trajectories for which all occurring {xt, at) pairs arc in K. For any ^ G S, let 
PrM(0 denote the probability of that trajectory happening in MDP M. 

Let v{^) be the discounted total reward received by the agent along the 
H-step trajectory ^ e S. Now, we have the following: 

Ql,{x,a) = 5^PrM(0H0 

< E p^m(Ov{o+ E prM(o^ 

< E P'^M(0«(0+Pr(AM)^ 
= ^ PrM(Ot'(0+Pr(^M)^ 

< QUx,a) + PT{AM)^°""'^ 



'1-7 



Theorem A.9 For any e > 0, 6 > 0, let 
ei := e/6 

(1-7)^ ^ 

' • |X|(l-7 + iiS.ax) " 

if := In -^21^ 

1-7 ei(l-7) 
2max{l,i?^^j2 8 

m := 1 ' maxJ 

4 s 

OIM converges almost surely to a near-optimal policy in polynomial time if 
started with 

_ 2{R^^^jHn{2\X\\A\m/6) 

ei(l-7)3 

that is, with probability 1 — 5, the number of timesteps where Q'^ {xt,at) > 
Q*{xt,at) — e does not hold, is at most 

2m|A-||.l|//<_^^4 



ei(l-7) 
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Remark A. 10 When expressed in terms of MDP parameters, time requirement 
is 



864 \X\' \A\ it!g,,, max{l, <„aJ^(l - 7 + Cax)^ . _6iCa^ , 4 8 
e3(l-7)4 e(l-7) ,5 ,5 

\ £3(1-7)4 e(l-7) Sj 
and the required initialization value is 

12«,,J^ , f 12 \A\ maxjl, i?g,,J^(l - 7 + i^g^ax)^ . 8\ 

n( «ax)^ 1 / l^n^l«ax)\ 

1^^(1-7)^ 1^ Ml -7)^ V/ 

Proo/. 

Let M denote the true (and unknown) MDP, let AI be the approximate 
model of OIM. 

An [x, a) pah is considered known if it has been visited at least m times. 
According to Lemma lA.2| for a known pair (a;, a), the model estimates P{x, a, •) 
and R{x, a, •) are e2-close to the true values with probability at least 1 — 5/4. 

Define the MDP M so that it is identical to M for known pairs, and equals 
M for unknown pairs. The parameters of M and M are identical on unknown 
pairs and e2-close for known pairs (with probability 1 — (5/4), so, by Lemma |A.3[ 

\Qlix,a)-Ql,ix,a)\<e, (6) 

for any policy tt and any (x, a) ^ X x A. 
Let 

1 R° 
H ■- In 



1-7 ei(l-7)' 
By Lemma |A.6[ 

\Qliix,a,H)~Qliix,a)\<e, (7) 

holds for the i?-step truncated value function for any (x, a), tt. 

Consider a state-action pair [xi, oi) and a -ff-step long trajectory generated 
by TT. Let K be the set of known {x, a) pairs and let A]\i be the event that an 
unknown pair is encountered along the trajectory. Then, by Lemma |A.8[ 

Qlti^uai) > Qlixuai) -^Pr (Am). (8) 

By applying Lemma [A. 71 to ei, (5/4, we get that the above setting of i?niax 
ensures that the original and the modified version of OIM behaves similarly: 

Ij {x,a)-Qlj {x,a) < 2ei (9) 
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with probability at least 1 — (5/2. Furthermore, by Lemma [A. 5 1 (with e ^ ei and 
6 <— (5/4), the modified algorithm preserves the optimism of the value function 
with probability at least 1 — 5/4: 



,mOIM 



{x, a) > Q*{x, a) — ei 



To conclude the proof, we separate two cases (following the line of thoughts 
of Theorem 1 in [SL]). In the first case, an exploration step will occur with high 
probability: Suppose that Fi{Am) > ei(l — 7)/-Rmax) that is, an unknown pair 
is visited in H steps with high probability. This can happen at most m |X| \A\ 
times, so by Azuma's bound, with probability 1 — (5/4, all (x, a) will become 

known after ^"'^"^^^[^i^)^"""' In | exploration steps. 

On the other hand, if Pr(AA/) < ei(l — 7)/^max7 then the policy is near- 
optimal with probability 1 — 5: 



where we applied (in this order) the property that truncation decreases the value 
function; Eq. ([5]); our assumption; Eq. ([7]); Eq. ([5]); Eq. ([S]); Lemma [A. 5 1 and 
the definition of ei. 



B A dimension-respecting version of the proof 

For the proof, we shall follow the technique of ? (?) and ? (?), and will use 
the shorthands [KS] and [SL] for referring to them. We will proceed by a series 
of lemmas. 

Lemma B.l Consider an MDP M = (X, A, P, R,j), and let {x, a) be a state- 
action pair that has been visited at least m times. Let P(x,a,y) and R(x,a,y) 
denote the obtained empirical estimates, let e > and 5 > be arbitrary positive 
values. If 




(xi,ai) - 2ei 



m > 
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then for ally Q X, 



P{x,a,y)R{x,a,y) - P{x,a,y)R{x,a,y) < ei^ax and 



P{x,a,y) - P{x,a,y) 
holds with probability at least 1 — 6. 



< e 



Proof. The second statement is already proven, so let us consider the first one. 
Suppose that (a;, a) is visited k times at steps ti, . . . ,tk. Define the random 
variables 

rti+i, if xt^+i = y; 
0, otherwise. 



Wiiy) 



In this case, E[Wi{y)] = P{x,a,y)R{x,a,y), Wi{y) — P{x,a,y)R{x,a,y) is a 
martingale and is bounded by i^^ax (note that we are considering only states 
x,y G X, that is, the gardcn-of-Eden state is excluded. Therefore, i?^ax is 
indeed an upper bound on R{x,a,y)), so we can apply Azuma's lemma with 
a = fceiJ^a^ to get 



Pr 



1 

r'^Wi- P{x, a, y)R{x, a, y) 



The right-hand side is less than 6 for 



2 2 
m> ^ In - . 



The following is a minor modification of [KS] lemma 4, and [SL] Lemma 1. 
The result tells that if tlic^ parameters of two MDPs arc very close to each other, 
then the value functions in the two MDPs will also be similar. 

Lemma B. 2 Let e > 0, and consider two MDPs M = {X,A,P,R,^) and 
M = {X, A, P, R, 7) that differ only in their transition and reward functions, 
furthermore, their difference is bounded: 



\P{x,a,y)R{x,a,y) - P{x,a,y)R{x,a,y)\ < e'R^^^ 

\P{x,a,y) - P{x,a,y)\ < e' 

for all {x,a,y) G X X A X X and 

(l-7)^ 



e := 



\X\ 



Then for any policy tt and any {x,a) £ X x A. 

\Q''{x,a) - Q''{x,a)\ < eR] 





max* 



and 
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Proof. Let A := max.(^x,a)exxA\Q^{x,a) — Q'^{x,a)\, and note that for any 
X G X, 



\V''{x)-V''{x)\ 



^7r(x,a)(Q'^(a;,a) - Q''{x,a)) 



< ^7r(a;,a)A = A 



For a fixed {x, a) pair, 

A=\Q''{x,a)-Q''{x,a)\ 



^ P{x, a, y) (r{x, a, y) + -fV^ {y)) - v) (^(^' v) + 7^"(y)) 

vex vex 

<Y\P{x,a, y)R{x, a, y) - P{x, a, y)R{x, a, y) \ 
Y,[Pi3:,a,y)-P{x,a,y)\ (^^V^y) 



yex 
+ 



yex 



J2 P{^,a,y){l[v^y) -V^y)]) 



vex 



< |X|e'<,,+ 5^e'|7F-(y)| + ^ P{x,a,y){jA) 
yex yex 

I iPm. 



<\X\e'Rl,^ + \X\e'f^+^A. 



Therefore, 



^ - (1_^)2 - ^^m, 



Let us introduce a modified version of OIM that behaves exactly hke the 

old one, except that in each [x, a) pairs, it performs at most m updates. If a 
pair is visited more than m times, the modified algorithm leaves the counters 
unchanged. 

The following result is a modification of [SL]'s Lemma 7. 

Lemma B.3 Suppose that the modified OIM (stopping after m updates) is ex- 
ecuted on an MDP M = {X, A, P, R, 7) with 

2 2 
m := -^m-. 



/3 := p^^2\u{2\X\\A\m/5). 
i — 7 



■7 

Then, with probability at least 1 — S, 

Q*{x,a)-J2Pt{x,a,y) [Rt{x,a,y) + ^V*{y)] < (3/Vk 
yex 

for allt= 1,2,... 



24 



Proof. The proof is identical to the proof of Lemma [A. 41 ■ 
The following result shows that the modified OIM algorithm preserves the 
optimism of the value function with high probability. 

Lemma B.4 Let ei > and suppose that the modified OIM is executed on an 
MDP M ^ {X, A, P, R, 7) with 



..max 

C i -I Ljjj. 

where 

2 2 
m := ^In-, 


R° 



/3 := -^^2H2\X\\A\m/S). 
1-7 

Then, with probability at least 1 - S/2, Q'^°^^\x,a) > Q*(x,a) - eiR^,^^ for 
allt^\,2,... 

According to the previous lemma, 

^ Pt{x, a, v){Rt{x, a, y) + ^V* {y)) - Q*{x, a) > -~P/^Nt{x,a) (10) 
y 

with probability 1 — 5/2. 
We will show that 

+ (1 _ ^),^ j^C) > J (11) 



iV,(x,a)(l-7) ^ -X- ^^v^ 

For Nt{x,a) < ^^jjzr^jr^, the first term dominates the l.h.s. and we can 
omit the second term (and prove the stricter inequality). In the following, we 
proceed by a series of equivalent transformations: 

-^max ^ P 



Nt{x,a)(l^j) - ^Nt{x,ay 



/?(l-7) 
R?, 



> ^Nt{x,a), 



"'max 



/32(l-7)2 

which is implied by the stricter inequality 



> Nt{x,a), 



/32(l-7)2 - i?C)^^(l-7)2e,' 
''max 
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which holds by the assumption of the lemma. If the relation is reversed, then 
the first term can be omitted, leading to 



(l-7)ei<ax > 



/3 



(1 ~ 7)eii??„ax 



< Ntix,a), 



which is implied by the stricter inequality 



R„ 



'■max 



similarly to the previous case. 

At step t, a number of DP updates are carried out. We proceed by induction 
on the number of DP-updates. Initially, (5'^*'-'(x, a) > Q*{x, a) — eiiij^j^^, then 

Q(^+i)(x,a) = ^A(x,a,y)(i?t(x,a,y)+7yW(y)) ^"^"^ 



Nt{x,a) 



> ^Pt(a;,a,y)(i?t(x,a,y) +7(V^*(2/) -ei<ax)) + J^J^ 



"max 



Ntix,a) 

> Q*{x, a) - 7ei-Rmax - (1 - 7)ei-Rmax = Q*ix, a) - eii?SJiax: 

where we applied ([2]), ^ and the induction assumption. I 
Define the H-stcp truncated value function of policy tt as Q'^{x,a,H) := 

Lemma B.5 Let e > and consider an MDP M = {X, A, P, R, 7). // 

H > log ■ ^ 



1-7 e(l-7) 
then 

Q^{x, a, H) < Q^x, a) < Q^x, a, H) + ei?^, 
for any (x, a) ^ X x A. 

Proof. Let S(a;, a) denote the set of infinite trajectories starting in [x, a), and for 
any trajectory ^ e a), let S^h denote its ff-step truncation. Furthermore, 
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denote the discounted total reward along a trajectory ^ by v{^). Clearly, 
Q^x,a) - E^[vm= P'iO^iO and 

eeE{x.a) 

Fix a trajectory p, along which the agent receives rewards ri, r2, ■ • for which 

v{£.h) = Y l^n+i and 
t=o 

oo oo 
t=0 t=H 

It is trivial that w(^) > v{^h)i as the additional terms are all nonnegative by 
assumption. On the other hand, 

oo oo j:^ 

which is smaller than eB!^^^ ii H > loge(l — 7)/ log 7 (which follows from the 
assumption of the lemma and the inequality — log 7 > 1 — 7), that is. 

As the relations hold for each trajectory in a), they hold for the expected 
value, too. ■ 
The following lemma tells that OIM and its modified version learn almost 
the same values with high probability. 

Lemma B.6 For any e > 0, S > 0, 

\X\ 

2 2 
m > ^ In - , 


for any MDP M and any {x,a) ^ X x A, 

Ql, {x.a)~Ql, (x,a) < 2ei?^,, 
with probability at least 1 — 25. 

Proof. The model estimates of the two algorithm- variants are identical on not- 
yet-known states where the visit count is less than m. On known pairs, we can 
apply Lemma FB. II to both model-estimates to see that they are e'-close to the 
true model parameters with probability at least 1 — 5. Consequently, they are 
2e'-close to each other with at least 1 — 2(5 probability. Applying Lemma IB. 21 
proves the statement of the lemma. I 
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Lemma B.7 Let M = {X, A, P, R, 7) be an MDP, K a set of state-action pairs, 
M an MDP equal to M on K (identical transition and reward functions), tt a 
policy, and H some positive integer. Let Am be the event that a state- action 
pair not in K is encountered in a trial generated by starting from {x, a) and 
following tt for H steps in M . Then, 

QlAx, a) > Qlix, a) - ^ Pr(AM). (12) 

1-7 

Proof. The lemma is identical to Lemma [A. 81 ■ 
Theorem B.8 For any e > 0, S > 0, let 



(1-7) 



2 



1-7 £1(1-7) 

2 8 
TO :— ^In-. 

£2 5 

OIM converges almost surely to a near-optimal policy in polynomial time if 
started with 

2Rl^,^H2\X\\A\m/5) 



R„ 



£1(1-7)^ 



that is, with probability 1 — S, the number of timesteps where Q"^ {xt,at) > 
Q*{xt,at) — ei?^jjx does not hold, is at most 

2m\X\ \A\H 
ei(l-7) 

Remark B.9 When expressed in terms of MDP parameters, time requirement 
is 

8641X1^ Ml 6 ,48 

= O ' ' ' , In — In^ - 

\^e3(l-7)4 e(l-7) Sj 

and the required initialization value is 

_ UR^. ( 144|X|^|^| ^ 8\ 

- £(1 _ ^)2 5^(1 _ S j 

|3 I 



""[eil-^r Ue(l-7)^ -5, 
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Proof. 

Let M denote the true (and unknown) MDP, let Af be the approximate 
model of OIM. 

An (x, a) pair is considered known if it has been visited at least m times. 
According to Lemma IX2l for a known pair {x, a), the model estimates P{x, a, •) 
and P{x, a, ■)R{x, a, •) are e2-close and e2i?max to the true values with probability 
at least 1 — (5/4. 

Define the MDP M so that it is identical to M for known pairs, and equals 
M for unknown pairs. The parameters of M and M are identical on unknown 
pairs and e2-close for known pairs (with probability 1 — (5/4), so, by Lemma |A.3[ 

\Ql{x,a)~Qlj{x,a)\<e^R'^^,^ (13) 

for any policy tt and any {x, a) ^ X x A. 
Let 

1 1 

H := In ■ 



1-7 ei(l-7)' 
By Lemma FB. 5 1 

\QIj{x, a, H) - Ql,{x, a)\ < e,K,^ (14) 

holds for the _ff-step truncated value function for any {x,a), n. 

Consider a state-action pair (xi, oi) and a _ff-step long trajectory generated 
by TT. Let K be the set of known [x, a) pairs and let A]\i be the event that an 
unknown pair is encountered along the trajectory. Then, by Lemma |B.7[ 

Qli{xi,ai) > Ql{x,,ai)-^PiiAM). (15) 

By applying Lemma FB. 61 to ei, 5/4, we get that the above setting of i?,nax 
ensures that the original and the modified version of OIM behaves similarly: 

Ql7""\x,a)-Ql"'"'{x,a) < 2ei<,, (16) 



with probability at least 1 — (5/2. Furthermore, by Lemma lB.4l fwith e ei and 
6 <— (5/4), the modified algorithm preserves the optimism of the value function 
with probability at least 1 — (5/4: 

Qr'''*'(2:,a)>Q*(x, a) 

To conclude the proof, we separate two cases (following the line of thoughts 
of Theorem 1 in [SL]). In the first case, an exploration step will occur with high 
probability: Suppose that Pt{Am) > ei-Rmax(l ~ 7)i that is, an unknown pair 
is visited in H steps with high probability. This can happen at most m |X| \A\ 
times, so by Azuma's bound, with probability 1 — (5/4, all {x,a) will become 
known after In | exploration steps. 
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On the other hand, if Pr(AM) < ei(l — 7)^max) then the poUcy is near- 
optimal with probabiUty 1 — 6: 

mOIM „ 

>Qm (a;i,ai) - 5eii?^a^ 
>Q*(a;i,ai)-6ei<,, 
= Q*{xi,ai) - ei?2iax: 



where we apphed (in this order) the property that truncation decreases the value 
function; Eq. ((151); our assumption; Eq. ([14]); Eq. ([TS]); Eq. ([TH); Lemma lR4l 
and the definition of ei. 



References 

Auer and Ortner][2006]Auer06Logarithmic Auer, P., & Ortncr, R. (2006). Log- 
arithmic online regret bounds for undiscounted reinforcement learning algo- 
rithms. NIPS (pp. 49-56). 

Brafman and Tennenholtz] [2001]Brafman01R-MAX Brafman, R. L, & Tennen- 
holtz, M. (2001). R-MAX - a general polynomial time algorithm for near- 
optimal reinforcement learning. Proc. IJCAI {pp. 953-958). 

Dearden] [2000]Dearden00Learning Dearden, R. W. (2000). Learning and plan- 
ning in structured worlds. Doctoral dissertation, University of British 
Columbia. 

Even-Dar and Mansour] [2001]Even-Dar01Convergence Even-Dar, E., & Man- 
sour, Y. (2001). Convergence of optimistic and incremental Q-learning. NIPS 
(pp. 1499-1506). 

Kearns and Singh] [1998]Kearns98Near-OptimalKearns, M., & Singh, S. (1998). 
Near-optimal reinforcement learning in polynomial time. Proc. ICML (pp. 
260-268). 

Kearns and Singh] [2002]Kearns02Near-Optimal Kearns, M., & Singh, S. (2002). 
Near-optimal reinforcement learning in polynomial time. Machine Learning, 
49, 209-232. 



30 



Koenig and Simmons] [1993]Kocnig93Complcxity Kocnig, S., & Simmons, R. G. 
(1993). Complexity analysis of real-time reinforcement learning. Proc. AAAI 
(pp. 99-105). 

Littman and Szepesvari] [1996]Littman96Generalized Littman, M. L., & 
Szepesvari, C. (1996). A generalized reinforcement-learning model: Con- 
vergence and applications. Proc. ICML (pp. 310 318). Morgan Kaufmann. 

Mculoau and Bourginc] [1999]Mculcau99Exploration Mcnlcau, N., & Bourginc, 
P. (1999). Exploration of multi-state environments: Local measures and back- 
propagation of uncertainty. Machine Learning, 35, 117-154. 

Singh et al.][2000]Singh00Convergence Singh, S. P., Jaakkola, T., Littman, 

M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy 
reinforcement-learning algorithms. Machine Learning, 38, 287 308. 

Strchl and Littman] [2005] Strehl05Theoretical Strehl, A. L., & Littman, M. L. 

(2005) . A theoretical analysis of model-based interval estimation. Proc. ICML 
(pp. 856-863). 

Strehl and Littman] [2006] StrehlOG Analysis Strehl, A. L., & Littman, M. L. 

(2006) . An analysis of model-based interval estimation for Markov decision 
processes. Submitted. 

Strens][2000]Strens00Bayesian Strens, M. (2000). A Bayesian framework for 
reinforcement learning. Proc. ICML (pp. 943-950). Morgan Kaufmann, San 
Francisco, CA. 

Sutton and Barto] [1998]Sutton98Rcinforccment Sutton, R. S., & Barto, A. G. 
(1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge. 

Wiering and Schmidhuber][1998]Wiering98EfHcient Wiering, M. A., & Schmid- 
huber, J. (1998). Efficient model-based exploration. Proc. SAB: From Animals 
to Animats (pp. 223-228). 



31 



